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Abstract 

Word2vec [2, as an efficient tool for learning vector representation of 
words has shown its effectiveness in many natural language processing 
tasks. Mikolov et al. issued Skip-Gram and Negative Sampling model [3] 
for developing this toolbox. Perozzi et al. introduced the Skip-Gram 
model into the study of social network for the first time, and designed an 
algorithm named DeepWalk [6] for learning node embedding on a graph. 
We prove that the DeepWalk algorithm is actually factoring a matrix M 
where each entry Mij is logarithm of the average probability that node i 
randomly walks to node j in fix steps. We will explain it in section 3. 


1 Notation 

Network G = ( V,E ). Node-context set D is generated from random walk, 
where each piece of D is a node-context pair ( v,c ). V is the set of nodes and 
Vc is the set of context nodes. In most cases, V = Vc- 
Consider a node-context pair (v,c): 

#(v, c) denotes the number of times (v, c) appears in D. #(u) = Y^c'eVc 
and #(c) = E„'ev denotes the number of times v and c appears in D. 

Note that \D\ = J2 v , eV E C 'ev c c ')- 

DeepWalk algorithm embeds a node v into a d-dinrension vector it € 
Also, a context node c G Vc is represented by a d-dimension vector G R d . 
Let W be a \ V\ x d matrix where row i is vector vt and LI be a \Vc\ x d matrix 
where row j is vector Cj. Our goal is to figure out a matrix M = WH T . 

2 Proof 

Perozzi et al. implemented DeepWalk algorithm with Skip-Gram and Hier¬ 
archical Softmax model. Note that Hierarchical Softmax 0 0 is a variant of 
softmax for speeding the training time. In this section, we give proofs for both 
Negative Sampling and softmax with Skip-Gram model. 


2.1 Negative Sampling 

Negative Sampling approximately maximizes the probability of softmax func¬ 
tion by randomly choosing k negative samples from context set. Levy and 
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Goldberg showed that Skip-Gram with Negative Sampling model(SGNS) is im¬ 
plicitly factorizing a word-context matrix [I] by assuming that dimensionality d 
is sufficiently large. In other words, we can assign each product v ■ a value 
independently of the others. 

In SGNS model, we have 

P(0, c) GD) = a(lt ■ -$) = 1 ^ 

1 + e~ v e 

Suppose we choose k negative samples for each node-context pair (v,c) ac¬ 
cording to the distribution Pd{cn) = Then the objective function for 

SGNS can be written as 


l = J2J2 #(v,c)(\oga(!t-^) + k-E CN „ PD [\oga(-!t-^)}) 

v£V cEVc 

= V l °g cr (-^• ~$) 

v£V cEVc vGV cnGVc 

= (#( v : c )( 1 °g cr (^-"^) + fc -#( l; ) ■ • it) 

vGV ceVo ' ' 


Denote x = v ■ it. By solving tp- = 0, we have 


it ■ It = x = log JF! ~ lo g k 

#W ' #(c) 


#( v i> c j ) 

Thus we have = log — logfc. can be interpreted as Point- 

Tof rw 

wise Mutual Information(PMI) of node-context pair ( Vi,Cj ) shifted by log A;. 

2.2 Soft max 

Since both Negative Sampling and Hierarchical Softmax are variants of soft- 
max, we pay more attention to softmax model and give a further discussion in 
next section. We also assume that the values of it • it are independent. 

In softmax model, 


P((v,c) eD) = - e ^ C ^ 

Ec'evc^ 0 




d- 


=f 


And the objective function is 

l =J2Yl #( w ’ c ) 'log- 

vgv cgVc Ec'eVc 

After extracting all terms associated to it - it as l(v,c), we have 


l(v,c) = #(v,c) log- 


E C ' 


c’eVc,c’jic 


it 




‘ e cGVc ,c^c 


^2 #(«, c) log 


itl 


E C ' 


c'€Vc,c'^c 


it 




o t ■ i 
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Note that l = J2 v ev E c ev c l ( v > c )- Denote x = ^ % solving J| = 0 

for all such x, we have 

V • = x = log ^ ^ + b v 

# 0 ) 

where b v can be any real constant since it will be canceled when we compute 
P{{v, c) £ D). Thus we have My = log + b Vi . We will discuss what Mis¬ 

represents in next section. 


3 Discussion 

It is clear that the method of sampling node-context pairs will affect matrix 
M. In this section, we will discuss and based on an ideal 

sampling method for Deep Walk algorithm. 

Assume the graph is connected and undirected and window size is t. We 
can easily generalize this sampling method to directed graph by only adding 
{RWi, RWj) into D. 


Algorithm 1 Ideal node-context pair sampling algorithm 
Generate an infinite long random walk RW. 

Denote RWi as the node on position i of RW, where i = 0,1,2,... 
for « = 0,1,2,... do 
for j £ [i + 1, i + f] do 
add (RWi, RWj) into D 
add {RWj, RWi) into D 
end for 
end for 


Each appearance of node i will be recorded 2 1 times in D for undirected 
graph and t times for directed graph. Thus we can figure out that is the 

frequency of Vi appears in the random walk, which is exactly the PageRank 
value of V{. Also note that expectation times that Vj is observed 

in left/right t neighbors of u*. 

Denote the transition matrix in PageRank algorithm be A. More formally, 
let di be the degree of node i. Ay = if (i, j) £ E and Ay = 0 otherwise. We 
use e* to denote a |V|-dimension row vector where all entries are zero except 
the z-th entry is 1. 

Suppose that we start a random walk from node z and use e* to denote the 
initial state. Then e^A is the distribution over all the nodes where j -th entry 
is the probability that node i walks to node j. Hence j-th entry of e,;A 4 is the 
probability that node i walks to node j at exactly t steps. Thus [e»(A + A 2 + 
• • • + A t )]j is the expectation times that Vj appears in right t neighbors of Uj. 

Hence 


, Vj ) 

#0i)/2f 


2[e,(A + A 2 H-hA‘)] 3 
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#{vi,Vj) _ [ei(A + A 2 + ■ ■ • + A t )\j 

#0i) t 

This equality also holds for directed graph. 

By setting b Vi = log 2 1 for all i, My = log logarithm of the expec¬ 

tation times that Vj appears in left/right t neighbors of Vi- 

By setting b Vi = 0 for all i, My = log = log l e *( A+A i h A i s 

logarithm of the average probability that node i randomly walks to node j in t 
steps. 
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